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(54) Speech recognition method and apparatus 



(57) Speaker independent speech recognition is 
made highly accurately without setting any recognition 
unit, such as triphone, and by taking environment 
dependency of phonemes into considerations. A word 
dictionary unit 10 stores phoneme symbol series of a 
plurality of recognition subject words. A transition prob- 
ability memory unit 20 stores transition probabilities 
associated with IM x N mutual state transitions of N 
states in a given order to one another. An output proba- 
bility memory unit 30 stores phoneme symbol output 

FIG. 1 
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probabilities and feature vector output probabilities 
associated with the respective state transitions. A work 
comparing unit 40 calculates probabilities pf sets of 
unknown input speech feature vector time series and 
hypothetical recognition subject words. A~ recognition 
result output unit 50 provides a highest probability word 
among all the recognition subject words as a result of 
recognition. 
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Description 



The present invention relates to speech recognition method and apparatus for recognizing unknown input 
speeches and, more particularly, to large vocabulary speech recognition method and apparatus which permit recogni- 
5 tion of a large number of words. 

For large vocabulary speech recognition, a method is extensively used, which have resort to triphone HMMs (Hid- 
den Markov Models). Specifically, this method uses "triphone units" as recognition units, which are each prepared for 
adjacent phonemes present as a phoneme unit in a word (or sentence). The "triphone HMM" is detailed in "Fundamen- 
tals of Speech Recognition, Part I. Part II. NTT Advanced Technology Co., Ltd, ISBN-4-900886-01-7" or "Fundamentals 
10 of Speech Recognition, Prentice Hall, ISBN-0-1 3-0551 57-2". 

In the speech recognition based on triphone HMMs, however, as many different HMMs as the cube of the number 
of different phonemes are involved, and it is difficult to accurately estimate all the triphone HMMs. To reduce the number 
of the different triphone HMMs, top-down or bottom-up clustering or the like is adopted, as detailed in the references 
noted above. Where the number of HMMs is reduced, however, it is no longer possible to guarantee the best fitness of 
15 the HMMs as such. In addition, such problem as having resort to intelligence concerening unreliable phonemes is 
posed. 

An object of the present invention is to provide a method of and an apparatus for large vocabulary number speech 
recognition, which permits indefinite speaker's speech recognition highly accurately without setting triphones or like 
recognition units and by making even environment dependency of phonemes into considerations. 

20 According to an aspect of the present invention, there is provided a speech recognition method of recognizing 
unknown input speech expressed as feature vector time series comprising the steps of storing phoneme symbol series 
of a plurality of recognition subject words, probabilities of N by N mutual state transitions of N states given sequential 
numbers to one another and phoneme symbol output probabilities and feature vector output probabilities associated 
with the individual state transitions; calculating probabilities of sets of feature vector time series and unknown input 

25 speech and phone symbol series of provisional recognition subject words from an ergodic hidden Markov model; and 
outputting a maximum probability word among all the recognition subject words. 

According to another aspect of the present invention, there is provided a speech recognition method of recognizing 
unknown input speech expressed as feature vector time series, comprising the sets of storing phone symbol series of 
a plurality of recognition subject words, probabilities of N by N mutual state transitions of N states given sequential num- 

30 bers to one another, phoneme symbol output probabilities and feature vector output probabilities associated with the 
individual state transitions and speaker's cluster numbers; and outputting a maximum probability word among all the 
recognition subject words. 

According to other aspect of the speech recognition apparatus for recognizing unknown input speech expressed as 
feature vector time series comprising: a word dictionary unit for storing a plurality of phoneme symbol series of a ptu- 

35 rality of recognition subject words; a transition probability memory unit for storing transition probabilities associated with 
N by N mutual state transitions of N states given sequential numbers to one another; an output probability memory unit 
for storing phoneme symbol output probabilities and feature vector output probabilities associated with the individual 
state transitions; a word comparing unit for calculating probabilities of sets of feature vector time series of unknown 
input speech and phoneme symbol series of provisional recognition subject words; and a recognition result output unit 

40 for outputting maximum probability word among all the recognition subject words as recognition result. 

According to still other aspect of the present invention, there is provided a speech recognition apparatus for recog- 
nizing unknown input speech expressed as feature vector time series comprising: a word dictionary unit for storing 
phone symbol series of a plurality of recognition subject words; a transition probability memory unit for storing transition 
probabilities associated with N by N mutual state transitions of N states given serial numbers to one another; an output 

45 probability memory unit for storing phone symbol output probabilities and feature vector output probabilities associated 
with the individual state transitions and speaker's cluster numbers; a word comparing unit for calculating probabilities 
of sets of feature vector time series of unknown input speech and phone symbol series of provisional recognition sub- 
ject words; and a recognition result output unit for outputting a maximum probability word among all the recognition sub- 
ject word and speaker's cluster numbers as recognition result. 

so The phoneme symbol is of a symbol by which a recognition subject word is defined absolutely or unanimously and 
is a syllable. 

According to the present invention, speaker's cluster numbers associated with respective state transition may also 
be stored, and probabilities for time series of feature vector of unknown input speech, and sets of phoneme symbol 
series of provisions! recognition subject words and provisional speaker's cluster number may be calculated, thereby 
55 outputting a maximum probability word among all the recognition subject words and speaker's cluster numbers. 

The method of and apparatus for speech recognition according to the present invention is greatly different from the 
prior art method in that while in the prior art method feature vectors alone are provided in HMMs, according to the 
present invention phoneme symbols are also provided in HMM and speaker's cluster numbers are further provided in 
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the HMM. Furthermore, in the prior art a word HMM is constructed as reference pattern for each recognition subject 
word by connecting together triphone HMMs, whereas according to the present invention a single ergodic HMM is used 
as common reference pattern for all recognition subject words. That is, according to the present invention natural and 
common use of model parameter is realized. 
5 Other objects and features will be clarified from the following description with reference to attached drawings. 

Fig. 1 shows a block diagram of a speech recognition apparatus according to an embodiment of the present inven- 
tion; 

Fig. 2 shows probability of state transition from the state 1 to the state 2; and 
io Figs. 3 and 4 are flow chart illustrating a specific example of the routine. 

Preferred embodiments of the present invention will now be described will now be described with reference to the 
drawings. 

An embodiment of the speech recognition apparatus according to the invention is shown in Fig. 1 . The speech rec- 
15 ognition apparatus, which can recognize unknown input speech expressed as feature vector time series, comprises a 
word dictionary unit 10 for storing phoneme symbol series of a plurality of recognition subject words, a transition prob- 
ability memory unit 20 for storing transition probabilities associated with N x N mutual state transitions of N states in a 
given order to one another, an output probability memory unit 30 for storing phoneme probabilities and feature vector 
output probabilities associated with the respective state transitions, a word comparing unit 40 for calculating probabili- 
20 ties of sets of unknown speech feature vector time series and hypothetical recognition subjectwords, and a recognition 
output unit 50 for providing a highest probability word among all the recognition subject words as a result of recognition. 
The input speech is expressed as time series X 

X S5 X 2 - X j...X j 

25 

of feature vectors x t . where feature vector x t is, for instance, a 1 0-dimensional cepstrum vector, subscript t being number 
(natural number) representing sequential time. 

In the word dictionary unit 10, phoneme symbol series of recognition subject words are stored. The phoneme sym- 
bol may sufficiently be of a Symbol unit less than a word, for instance a syllable, by which a recognition subject word 
30 can be defined absolutely or unanimously. 

m-th recognition subject word is expressed as w m , and its phoneme symbol series is expressed as 

w m=PlP2- -PKm : 

35 where Km represents the length of the phoneme symbol series. The total number of phoneme symbols is Np. and these 
phoneme symbols are given serial numbers. V 



TABLE 1 



Number 
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5 


6 




Np 


Phoneme Symbol 


A 


I 


u 


E 


o 


K 







For example, with a recognition subject word given by phonemes "akai", p 1 = 1, p2 = 6, P3 = 1 . p 4 « 2, and Km = 4. 
45 The total number of recognition subject words is N^ While in this embodiment phoneme symbols are used to express 
words, it is also possible to use other symbol systems such as syllables. 

The HMM employed for speech recognition in this embodiment is ergodic HMM using ergodic Markov chain. The 
ergodic HMM is detailed in the literatures noted above. Fig. 2 is a view illustrating the ergodic HMM, which will now be 
described. SpecrficaJly, states 1 and 2 and all transitions associated with these states are shown. For example, a 12 in 
so Fig. 2 represents the probability of state transition from the state 1 to the state 2. In the following, a case is considered, 
in which typically an ergodic HMM constituted by Ns states and mutual state transitions associated therewith is 
employed. 

In the transition probability memory 20. probabilities of ergodic HMM state transitions are stored. The probabilities 
of transitions from i-th to j-th state are expressed as a^. The probabilities ay meet conditions that their values are at least 
55 zero and that the sum of their values is 1 , as shown by the following formula. 
The initial probabilities of the states are 

a 9 >0. 
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5 

also stored in the transition probability memory 20. 

The initial probabilities of i-th state are expressed as n v The initial probabilities n x meet the following conditions. 

10 

2>i-i 

15 

In the output probability memory 30, phoneme symbol output probabilities and feature vector output probabilities asso- 
ciated with state transitions are expressed as fjj(p) where p represents p-th phoneme symbols. Since the number of dif- 
ferent phoneme symbols is N p 

20 N p 

p=1 



25 For example, f (j (1) represents the probability of output of phoneme symbol "a" in association with state transitions from 
i-th to j-th state. 

Feature vector output probabilities associated with state transitions from i-th to j-th are expressed as g^x). "The fea- 
ture vector output probabiliies gjj(x) are herein expressed as multi-dimensional Gaussian distribution. 

30 -1 

g Ax) = exp[-(x - n f £(x - u., y )] 

J(2K) D Vij\ v 

* -- 

35 where D is the dimension number of the feature vectors, is the mean vector, and Zy is the covariance matrix. 

The word comparing unit 40 calculates probabilities (or likelihoods) of recognition subject words. The logarith- 
mic value of probability P(w m , X) of m-th recognition subject word w m is calculated as follows. As noted before, 



40 



w m=PlP2 -Pk-PKm' and 
X — X -j X 2 • • -X |...X j. 

The partial sum of logarithmic probabilities is defined as: 

45 M'. 1 )^ 0 ^ 71 ^ 

4> 0 (/;/0 = -oo 

(\<k<K m ) 

maxlmax^C;,*')]-. Iog[a J+ logl/ /; (/>,)]+ logU /; (x,)Jl 



so 



55 j = 1 . • • • ,N S = k - 1 M 

(1 < t < 7,1 <i<Ns,-\ <k<K m ) 
Using the above initialization and recurrence formula, the word comparing unit 40 calculates the partial sum Mi, k) Of 
logarithmic probablities as three-dimensional array specified by three subscripts of t-th time, i-th state and k-th pho- 
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neme symbol for ail times 1^1 ail states l^i^Ns and ail phone symbols l^k^Km in recognition subject word 

From the partial sum ^(i. km) of logarithmic probabilities thus obtained, the logarithmic value of probablities P(w m , 
X) of m-th recognition subject word w m is obtained as: 

\09[P(w mt X)] = maxlb T (i,K m )] 

/= 1.- • - ,A/ S 

The word comparing unit 40 calculates the logarithmic probabilities of all the recognition subject words. Figs. 3 and 4 
are flow chart illustrating a specific example of the routine of the above process. In steps 1 01 to 1 08, the partial sum of 
logarithmic probabilities is initialized, in steps 109 to 133 the logarithmic value L of probability is calculated, and in step 
134 the logarithmic value L is outputted. In the initialization routine, in step 102 i-th initial probability n\ is substituted into 
<t>(0, i, 1) corresponding to t = 0. k = 1 . For <t>(0. 1 , k) when k is at least 2, -oo is substituted in step 1 04. Since logarithmic 
probabilities are dealt with at this moment, -oo corresponds to anti -logarithm zero. Likewise, in sep 1 1 3 -oo is substituted 
into <}>(t. i, k) as logarithm of anti-logarithm zero. 

When the probabilities of all the recognition subject words have been obtained in the above way, the recognition 
result output unit 50 outputs word 

W rr, 

which gives the maximum probability among these probabilities as recognition result. That is; 

m = arg max [log(>( , X)]\ 



While a preferred embodiment of the present invention has been described, it is by no means limitative. For example, 
while in the above embodiment the HMM output is provided by having feature vector output probabilities and-phoneme 
symbol output probabilities associated with state transitions, it is possible to have also speaker's cluster number output 
probabilities associated with state transitions. 

Where the speaker's cluster number output probabilities associated state transitions, the speaker's cluster number 
output probabilities are expressed as hjj(q). Where the total number of speaker's clusters is N Q we have 

The speaker's cluster numbers are stored in the output probability memories 30. The initialization and recurrence for- 
mula noted above are expanded with the partial sum of logarithmic probabilities as a four-dimensional array as 

<M'. 1 <7) = , °9[tt/J. 

M'^'Q) = -°°. 

(1 <k *K mi 1 zqzQ) 
From the partial sum of logarithmic probabilities 

* t VXq) = maxtmaxto^fjXq^^ 

/c'=/c-1,/c 
/=!.•• *.A/ S 

(1 <,t <; 7,1 <, i<. A/s,1 zkzK m ,1 <; q<. Q) 
From the partial sum of logarithm probabilities thus obtained, the logarithmic value of probability of recognition subject 
word wm is obtained as 

iog[F( w m ,X)] = maxfmax b T (i,K m ,q)] 
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/ = 1, • • • ,N S q = 1, • • • ,Q 
These calculations are executed in the word comparing unit 40. 

The recognition result output unit 50 outputs a word of the maximum probability among ail the recognition subject 
words and speaker's cluster numbers as recognition result. 
5 By adding the speaker's cluster numbers to the ergodic HMM output, it is possibe to obtain speech recognition even 

with automatic determination of the optimum speaker character even in speaker independent speech recognition. 

As has been described in the foregoing, according to the present invention by using a single ergodic HMM for out- 
putting phoneme symbol series and feature vector series it is possible to realize a large vocabulary speech recognition 
apparatus, which does not require setting "triphones" or like recognition units and takes even environment dependency 
10 of phonemes into considerations. In addition, by adding speaker's cluster numbers to the output of the ergodic HMM 
output, rt is possible to realize an apparatus, which can recognize speech with automatic determination of optimum 
speaker character even in speaker independent speech recognition. 

changes in construction will occur to those skilled in the art and various apparently different modifications and 
embodiments may be made without departing from the scope of the present invention. The matter set forth in the fore- 
15 going description and accompanying drawings is offered by way of illustration only. !t is therefore intended that the fore- 
going description be regarded as illustrative rather than limiting. 

Claims 

20 1 . A speech recognition method of recognizing unknown input speech expressed as feature vector time series com- 
prising the steps of: 

storing phoneme symbol series of a plurality of recognition subject words, probabilities of N by N mutual state 
transitions of N states given sequential numbers to one another and phoneme symbol output probabilities and 
25 feature vector output probabilities associated with the individual state transitions; 

calculating probabilities of sets of feature vector time series and unknown input speech and phone symbol 
series of provisional recognition subject words from an ergodic hidden Markov model; and 
outputting a maximum probability word among all the recognition subject words. 

30 2. A speech recognition method of recognizing unknown input speech expressed as feature vector time series^ com- 
prising the steps of: 

storing phone symbol series of a plurality of recognition subject words, probabilities of N by N mutual state-tran- 
sitions of N states given sequential numbers to one another, phoneme symbol output probabilities and feature 
35 vector output probabilities associated with the individual state transitions and speaker's cluster numbers; and 

outputting a maximum probability word among all the recognition subject words. 

3. The speech recognition method as set forth in claim 1 or 2, wherein the phoneme symbol is of a symbol by which 
a recognition subject word is defined absolutely or unanimously. 

40 

4. The speech recognition method as set forth in claim 1 or 2, wherein the phoneme symbol is a syllable. 

5. A speech recognition apparatus for recognizing unknown input speech expressed as feature vector time series 
comprising: 

45 

a word dictionary unit for storing a plurality of phoneme symbol series of a plurality of recognition subject 
words; 

a transition probability memory unit for storing transition probabilities associated with N by N mutual state tran- 
sitions of N states given sequential numbers to one another; 
50 an output probability memory unit for storing phoneme symbol output probabilities and feature vector output 

probabilities associated with the individual state transitions; 

a word comparing unit for calculating probabilities of sets of feature vector time series of unknown input speech 
and phoneme symbol series of provisional recognition subject words; and 

a recognition result output unit for outputting maximum probability word among all the recognition subject 
55 words as recognition result. 

6. A speech recognition apparatus for recognizing unknown input speech expressed as feature vector time series 
comprising: 



rmcrwin. ,cn rsocn. 
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a word dictionary unit for storing phone symbol series of a plurality of recognition subject words; 
a transition probability memory unit for storing transition probabilities associated with N by N mutual state tran- 
sitions of N states given serial numbers to one another; 

an output probability memory unit for storing phone symbol output probabilities and feature vector output prob- 
5 abilities associated with the individual state transitions and speaker's cluster numbers; 

a word comparing unit for calculating probabilities of sets of feature vector time series of unknown input speech 
and phone symbol series of provisional recognition subject words; and 

a recognition result output unit for outputting a maximum probability word among all the recognition subject 
word and speaker s cluster numbers as recognition result. 

10 

7. The speech recognition method as set forth in claim 5 or6, wherein the phoneme symbol is of a symbol by which a 
recognition subject word is defined absolutely or unanimously. 

8. The speech recognition method as set forth in claim 5 or 6, wherein the phoneme symbol is a syllable. 

15 
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FIG. 4 
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(54) Speech recognition method and apparatus 

(57) Speaker independent speech recognition is 
made highly accurately without setting any recognition 
unit, such as triphone, and by taking environment 
dependency of phonemes into considerations. A word 
dictionary unit 10 stores phoneme symbol series of a 
plurality of recognition subject words. A transition prob- 
ability memory unit 20 stores transition probabilities 
associated with N x N mutual state transitions of N 
states in a given order to one another. An output proba- 
bility memory unit 30 stores phoneme symbol output 
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probabilities and feature vector output probabilities 
associated with the respective state transitions. A work 
comparing unit 40 calculates probabilities of sets of 
unknown input speech feature vector time series and 
hypothetical recognition subject words. A recognition 
result output unit 50 provides a highest probability word 
among all the recognition subject words as a result of 
recognition. 
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